74

Algorithms for Binary Neural Networks

For each term in Eq. 3.112, we have:

∂LS

kl,i

n

= ∂LS

ˆkl,i

n

ˆkl,i

n

(wl kl,i

n )

(wl kl,i

n )

kl,i

n

= ∂LS

ˆkl,i

n

11wlkl,i

n 1wl,

(3.113)

∂LB

kl,i

n

= λ{wl



wl kl,i

n ˆkl,i

n



+ ν[(σl

i)2(kl

i+ μl

i+)

+ (σl

i)2(kl

i+ μl

i)],

(3.114)

where 1 is the indicator function that is widely used to estimate the gradient of nondiffer-

entiable parameters [199], and (σl

i)2 is a vector whose elements are all equal to (σl

i)2.

Updating wl: Unlike the forward process, w is used in backpropagation to calculate the

gradients. This process is similar to the way to calculate ˆx from x asynchronously. Specifi-

cally, δwl is composed of the following two parts:

δwl = ∂L

wl = ∂LS

wl + ∂LB

wl .

(3.115)

For each term in Eq. 3.115, we have:

∂LS

wl =

Il



i=1

NIl



n=1

∂LS

ˆkl,i

n

ˆkl,i

n

(wl kl,i

n )

(wl kl,i

n )

wl

=

Il



i=1

NIL



n=1

∂LS

ˆkl,i

n

11wlkl,i

n 1kl,i

n ,

(3.116)

∂LB

wl = λ

Il



i=1

NIl



n=1

(wl kl,i

n ˆkl,i

n )kl,i

n .

(3.117)

Updating μl

i and σl

i: Note that we use the same μl

i and σl

i for each kernel (see Section

3.2). So, the gradients here are scalars. The gradients δμl

i and δσl

i are calculated as:

δμl

i = ∂L

∂μl

i

= ∂LB

∂μl

i

=

λν

Cl

i ×Hl×W l

Cl

i



n=1

Hl×W l



p=1



(σl

i)2(μl

i kl,i

n,p),

kl,i

n,p 0,

(σl

i)2(μl

i + kl,i

n,p),

kl,i

n,p < 0,

(3.118)

δσl

i = ∂L

∂σl

i

= ∂LB

∂σl

i

=

λν

Cl

i×Hl×W l

Cl

i



n=1

Hl×W l



p=1



(σl

i)3(kl,i

n,pμl

i)2+(σl

i)1,kl,i

n,p 0,

(σl

i)3(kl,i

n,p+μl

i)2+(σl

i)1,kl,i

n,p <0,

(3.119)

where kl,i

n,p, p ∈{1, ..., Hl ×W l}, denotes the p-th element of kl,i

n . In the fine-tuning process,

we update cm using the same strategy as center loss [245]. The update of σm,n based on

LB is straightforward and is not elaborated here for brevity.